Problem definition

Probability of Default is a basic and crucial parameter used in Credit Risk. Traditionally, it is based on credit scorecards which are built with logistic regression. However, the scope of informations available for the bank is different for various clients. If the customer is our current client (e.g. he has a bank account), we can access the data concerning his behaviour and assets. Hence, we can assume that estimated probability of default for this client is more precise than for a corresponding customer that does not have an account at our bank.

Therefore, we would like to propose an innovative approach into calculating Probability of Default, which could also assess the level of uncertainty of the calculated PD.

Theoretical background and literature review

Using Random Forest for credit risk models

The employment of Machine Learning models in credit risk modelling is put into doubt, in particular for regulatory purposes because of the “black box” effect. Nonetheless, usage of Machine Learning techniques is beneficial as it can help improve general model predictive power.

Decision trees are very intuitive models. The basic intuition behind a decision tree is to draw all possible decision paths so that they shape a tree. Each path from the root to a leaf forms a decision process.

Random forest is based on a simple but powerful concept - the wisdom of crowds.

Random forest consists of a large number of decision trees. Each tree is built on a random sample of the observations and for each tree of the forest a random set of features is chosen to split nodes. Then, when predicting a class variable, there is a plurality vote performed so that in the end one decision is made.

Random Forests are known to computationally demanding but with R you can run computations on multiple cores in parallel, lowering the computation time.

pros cons
• Limits overfitting
• High accuracy
• Easy choice of relevant variables
• Handles well missing values
• Low interpretability

• Parameter choice (number of trees, depth etc…)

Confidence intervals for random forests

We can estimate confidence intervals for random forests using formula proposed by researchers from Stanford in the paper “Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife” (Wager et al.). This means that we can estimate standard errors for random forests. Methods proposed by the authors are based on the jackknife and the ifinitesimal jackknife.

The infinitesimal jackknife is an alternative to the jackknife where, instead of studying the behavior of a statistic when we remove one observation at a time, we look at what happens to the statistic when we individually down-weight each observation by an infinitesimal amount. When the infinitesimal jackknife is available, it sometimes gives more stable predictions than the regular jackknife.

Variance of bagged learners

Variance of bagged learners

Implementation in R

Libraries

To start with, we need to load the required libraries. For random forest we’ll be using the ranger package. Then we’ll use plotting2 to plot the results, with the fantastic theme provided by the director of data science at Airbnb.

Experiment #1 - uneven groups

To start, we wanted to see the effect of a different number of observations within each group. If the model were to behave as expected, the group with fewer observations should have a wider margin of confidence.

To test this, we prepared an artificial dataset. We try to predict the default probability as a factor of the type of employment. The two options are: salary and self employed. The analysis is based on historical data and such we have an imbalanced set: 10 000 people on salary, and only 500 self employed.

The results are promising - the less populated group does indeed have a much higher margin of error. It also seems to follow the general limit theorem - the number of observations in the salaried group is 20 times bigger, but the variance is only about 4.5 times smaller.

Experiment #2 - even groups

In the previews experiment, the default probability was very different for the two groups. To exclude the possibility of this influencing the variance, we prepared a second experiment. This time the groups are more or less even, but the default rate is different.

The data is also artificial - we are simulating the influence of marital status on the pd. We have 10 000 observations of married people and 9 000 singles.

And the results are also promising - the groups seem to have similar margins of confidence, despite having very different probabilities of default.

Experiment #3 - real data

To push the experiments further, we searched for a real dataset. We found a promising database from Taiwan, consisting of 30 000 observations and 23 variables.

After importing the data, cleaning it and removing outliers, we imported it into the model. This time, we are taking a slightly different approach. We are running a 3-fold cross validation on this data to get the OOS probability and variance.

For the plot below, we are using a new metric - “Model confidence”. Since this is a binary classification problem, we calculate the model confidence as (1 - standard error) * 100%. Thous 100% means that the model is absolutely confident in it’s prediction and lower means that it starts to loose confidence. It has to be noted, that we calculate standard error and PD for each observation individually.

The results are very interesting - it seems that the model is very confident when it predicts PD near 0% and 100%, but it starts to loose confidence for some observations when the predicted probability is closer to 50%.

# Read processed data
# The code for processing can be found on GitHub, not included here for brevity
# Includes outlier removal and data cleaning
data <- read_csv("data/processed/credit_default_clean.csv")

# Remove the id column
data <- data[,-1]

target <- 'didDefault'
data[,target] <- factor(data[[target]])

# This function runs k fold cross validation and append following columns to original data:
# id - row number from data
# pred_prob - OOS predicted probability of default
# se - estimated OOS standard error
rf_cv <- function(data, n_folds=3){
  form <- formula("didDefault ~ .")
  
  folds <- sample(rep(c(1:n_folds),ceiling(nrow(data)/n_folds)), size = nrow(data))
  results <- list()
  
  for(i in 1:n_folds){
    # Build the model on training data
    rf_fit <- ranger(formula=form, 
                     data=na.omit(data[folds != i ,]), 
                     probability=TRUE,
                     keep.inbag=TRUE) 
    
    # Generate predictions and standard errors for test set
    pred <- predict(rf_fit, data=na.omit(data[folds == i,]), type = "se")
    
    results[i] <- list(as_tibble(list(id = which(folds==i), 
                                      pred_prob = pred$predictions[,2], 
                                      se = pred$se[,2])))
    
  }
  results <- do.call(rbind, results)
  results <- results %>% arrange(id)
  results <- cbind(data, results)
  return(results)
}

# Run the CV
res <- rf_cv(data, 3)

res$trueResult <- as.integer(res$didDefault) - 1

# Plot the results
ggplot(res) +
  geom_jitter(aes(x = pred_prob, y = 1 - se, color = 1 - se), alpha = 0.4) +
  scale_color_gradient(high = orange, low = purple) +
  labs(title = 'Model confidence vs predicted probability of default',
       subtitle = 'Mean probability of default: 26.7%',
       x = 'Probability of default',
       y = 'Model confidence') +
  theme(legend.position = 'none') +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1))

Experiment #4 - missing data

To improve on feedback received during our pitch we decided to test how our method can deal with missing values among independent variables. To accomplish that we artificially remove value of some variables for 50% of the observations from the test set (20% observations). Next we performed repeated imputation using Predictive Mean Maching using “mice” library. For each of the imputed datasets we performed prediction using Random Forest model fitted on training dataset. Final estimate of standard error is equal to root of sum of average variance of prediction between imputations and average variance estimated using infinitesmall jacknife.

Table below shows a standard error conditional on if the observations had missing values. We can see that it is larger for group with missing data.

RMSE depending if observation had missing values
had_na se
FALSE 0.046
TRUE 0.088

Plot below shows confidence of the model on y-axis and probability of default on y-axis. We can still see that confidence is lower when probability is close to 50%, but also there is strong decrease in confidence if there are missing observations.

Finally, we repeat our previous analysis, where we bin observations by predicted probability of default and in those bins compare expected fraction of defaulters with true fraction. That way we can see ifour standard error estimates provide a good margin of conservatism.

Missing data - alternative solution

Initially, we were going to use other approach for missing data, however, we have not found working implementation of required methods. The approach would omit the step of imputation and instead would on-the-fly imputation during prediction and standard error estimation. On-the-fly imputation is described in Tang, F., & Ishwaran, H. (2017) in following words:

  1. Only non-missing data is used to calculate the split-statistic for splitting a tree node.
  2. When assigning left and right daughter node membership if the variable used to splitthe node has missing data, missing data for that variable is “imputed” by drawing arandom value from the inbag non-missing data.
  3. Following a node split, imputed data are reset to missing and the process is repeateduntil terminal nodes are reached. Note that after terminal node assignment, imputeddata are reset back to missing, just as was done for all nodes.
  4. Missing data in terminal nodes are then imputed using OOB non-missing terminalnode data from all the trees. For integer valued variables, a maximal class rule isused; a mean rule is used for continuous variables."

Author’s claim to implement this method in R packages - randomForestSRC, but in our experience it is unstable and origin of crashes is probably within C layer. Therefore, we were unable to use it succesfully.

Potential uses and summary

In the era of Big Data and Artificial Intelligens, banks are prone to adopt evolving techniques as they help get better insights from data, reduce cost and increase overall profitability. With no doubts, using machine learning models is a must in financial world for in order to stay competitive.

Machine Learning and Credit Risk is be a suitable marriage - for example, it can help in areas where traditional methods disappoint, like estimating measure of uncertainty of probability of default for new clients, for which not all data is available.

Our method is based on random forests and estimating confidence intervals using infinitesimal jackknife. In our analysis we proved that our method meets the requirements set by the client.

  1. Proposed method takes into account the scope of available information
  1. Proposed method takes into account the number of similar observations

Sources

Confidence intervals for Random Forest

Wager, S., Hastie T., & Efron, B. (2014). Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. J Mach Learn Res 15:1625-1651

Dataset

Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480. https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Random Forest Missing Data Algortihms

Tang, F., & Ishwaran, H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(6), 363-377. https://arxiv.org/abs/1701.05305